Retrieving Images of Scanned Text Documents

نویسنده

  • Alan F. Smeaton
چکیده

Information retrieval is the task of nding documents, usually text, which are relevant to a user's information need. A conventional approach to information management of paper documents is normally based on classifying them into a hierarchical classiication structure. More recently we have seen electronic document management systems which manage scanned images of documents in the same way as paper, or which do some OCR to determine machine-readable document content which may then be used for content-based information retrieval. In this paper we present an alternative to full-scale OCR of scanned document images in which words in scanned images of documents are represented as word shape tokens (WSTs) representing the approximate shape of words in text. We describe an approach to WST-based information retrieval that can be entirely automatic or can involve the user in reening the retrieval process. To measure the eeectiveness of WST-based retrieval we have implemented and reened it on two collections of documents/queries/relevance judgments, one in English, the other in French. Each of these has over 250 Mbytes of ASCII text and in indexing documents we have faithfully recreated the kind and frequency of errors expected in a WST recognition process. When compared to word-based retrieval, the eeectiveness of WST-based retrieval is surprisingly good, though erratic across diierent queries. We outline some of the directions we are pursuing in order to address this inconsistency of performance across queries and our results show real potential for WST-based retrieval of scanned document images.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimedia Indexing And Retrieval Research at the Center for Intelligent Information Retrieval

The digital libraries of the future will include not only (ASCII) text information but scanned paper documents as well as still photographs and videos. There is, therefore, a need to index and retrieve information from such multi-media collections. The Center for Intelligent Information Retrieval (CIIR) has a number of projects to index and retrieve multi-media information. These include: 1. Th...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Curvature Correction and Shadow Images of Scanned Documents based on Boundary Lines of Text and Brightness Estimation Function

While the pages of a book or thick document to be scanned, two types of geometric and optical destruction for the scanned images arise. As a result of the damages, hick curvy lines of text in the document find and book binding are shaded. This problem makes it difficult to read. In the near future, it is an attempt to correct such images. In this paper, we review the methods for correcting the ...

متن کامل

Detecting Copy-Move Forgeries in Scanned Text Documents

The detection of copy–move forgeries has been studied extensively, however all known methods were designed and evaluated for digital images depicting natural scenes. In this paper, we address the problem of detecting and localizing copy–move forgeries in images of scanned text documents. The purpose of our analysis is to study how block-based detection of near-duplicates performs in this applic...

متن کامل

User-Mediated Word Shape Tokens for Querying Document Images

Word Shape Tokens (WSTs) are tokens used to represent words based on the overall shape or contour of a word as it appears in printed text. A character shape code (CSC) mapping function is used to aggregate similarly shaped letters such as \g" and \y" into one single code to represent those letters. The rationale behind this is that it is far easier and more accurate to map a scanned image of a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998